DOCUMENT RESUME 

ED 082 656 HE 004 682 



AUTHOB 
TITLE 



INSTITUTION 

PUB DATE 
NOTE 



Gillmore, Gerald M. 

Estimates of Reliability Coefficients for Items and 
Subscales of the Illinois Course Evaluation 
Questionnaire , Re port #34 U 

Illinois Dniv^r Urbana. Office of Instructional 

Resources^ 

Aug 73 

42p. 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



HF-$0,65 HC-$3.29 

Attitudes ; *Course Evaluation ; *Evaluation; 
Evaluation Methods; *Higher Education; 
*Questionnair es ; *Student Opinion 
♦Illinois Course Evaluation Questionnaire 



ABSTRACT 

The major focus of this paper is on the reliability 
of the individual items, the subscales, and the total score of Form 
66 of the Illinois Course Evaluation Questionnaire (CEQ) • The CEQ is 
a Likert-type attitude questionnaire designed to elicit evaluative 
information from students about courses in which they are enrolled. 
Form 66 contains 50 items that are combined by unweighted averaging 
to form six subscales and a total score, "Stability" coefficients 
were estimated by three methods applied to six samples. All estimates 
indicated reasonably high reliabilities for classes. The magnitude of 
these reliabilities were discussed in the context of standard errors 
of measurement that was discussed, in turn, within the context of 
norming, (Author) 



FILMED FROM BEST AVAILABLE COPY 



LS'i'i;i/.TES OF IlbLIAtilLITY COLFFICIbii'f S FOR ITLilS AillJ 
SUBSCALiiS OF TliE ILLINOIS COURSh LVALUATIOW QUESTIOiiUAULi: 



■/'y i Gerald li. Gillp.ore 

o i 



US DEPAt^TMENTOF HEALTH. 
EDUCATION 8. WELFARE 
NATION&HNSTITUTE OP 
EDUCATION 

TH^S OOCwr/ENT h;.S BfEN 9 f P -J^O 

rocED rx^cTiY AS pf.cc-ived ^ wo.. 

It,.C, 17 POINTS or VIEW OP.- OPINIONS 
<.T^rED DO NOT NFCESSAPILY PePRE 
SENT Of (^iClAL NATIONAL 'NMiTu'EC?^ 

eoucr>T<oN pos/t)On op policy 



" 

(\ • ^ ■ ■ " ■ 341 AUGUSl', 1973 




ERIC 



Abstract 

The major focus of this paper is on the reliability of the individual 
items » the subscales, and the total score of Form 66 of the Illinois Course 
Evaluation Questionnaire (CEQ) . The CEQ is a Likert-type attitude questionnaire 
designed to elicit evaluative information from students about courses in x^hich 
they are enrolled. Form 66 contains 50 items, which are combined by unweighted 
averaging to form six subscales and a total score. 

^'Stability" coefficients were estimated by three methods applied to six 
samples. All estimates indicated reasonably high reliabilities for classes. 
The magnitude of these reliabilities v;ere discussed in relation to number 
of students in a class. 

"Equivalence" coefficients v/ere presented for all subscales and discussed. 
Finally, reliabilities were discussed in the context of standard errors of 
measurement v/hich was discussed, in turn, x^/ithin the context of norming. 



NOTE: The results of this paper can be used with CEQ Form 72 and Form 73 as 
it contains items and subscales found in Form 66. 



ESTIWATES OF RELIABILITY COEFFICIENTS FOR ITEIIS AND 
SUBSCALES OF THE ILLFrlOIS COURSE EVALUATION QUEST lOlvINAIRE 

Gerald lU Gillinorp. 
The major focus of this paper will be on the reliability of the 
individual items, the subscales, and the total score of Form 66 of the 
Illinois Course Evaluation Questionnaire (CEQ). Various methods of 
estimation be discussed, and estimates using each method will be pre- 

sented. Reliability estimates for the six subscalea and for the total 
score will also be presented and discussed, as will estimates of the 
^^tandard errors o£ measurement, especially as related to Item and subscale 
norms* 

The Illinois Course Evaluation Questionnaire (CEQ) 
The CEQ is a Likert-type attitude questionnaire designed to elicit 
evaluative information from students about courses in which they are 
enrolled ♦ Form 66 of the CEQ contains 50 items. The items, numbered as 
they appear on the form, are listed in Table 1. All items use the four 
response categories: St rongly Agree, Agree, Disagree, Strongly Disagree. 
Subsets of these items are combined by unweighted averaging to form six 
subscales. The names of the subscales and the list of items which form 
them are fotind in Table 2. A total score is also available by averag.lng 
over all 50 items. Details of the development of the instrument can be 
found in Spencer and Aleamoni (1969). The actual form can be found in 
Appendix A. 

Reliability What 1b It? 
Reliability has been defined in a number of distinct but related 
ways. The thread of meaning throughout is the idea of "consistency** 
(Rozeboom, 1966, p. 375) or '^repeatable'* (Nunnally, 1967, 172). Reliability 
has a great deal of semantic overlap with the concept of "precision" as used 



Table 1 

The Iteias of tUe Illinois Course Evaluation Queationnairo 



1. I leaicn more when other teaching methods are used, 

2. It was a waste of time* 

3. Overall, the course was good. 

4. The textbook was very good. 

5. The Instructor seemed to be interested in students as persons* 

6. ilore courses should be taught this way. 

7. The course held my interest. 

8. I \^uld have preferred another method of teaching in this course. 

9. It was easy to remain attentive. 

10. The Instructor did not synthesize, integrate or summarise effectively. 

11. Not much was gained by taking this course. 

12. The instructor encouraged the development of nev7 viewpoints and appreciations. 

13. The course material seemed worthwhile. 

14. It was difficult to remain attentive, 

15. Instructor did not review promptly and in such a way that student^) could 
understand their weaknesses. 

16. Uome^^ork assignments were helpful in understanding the course. 

17. There was not enough student participation for this type of course. 

18. Tii3 instructor had a thorough knowledge of his subject matter. 

19. The cc»ntent of the course was good. 

20. The course increased my general knowledge. 

21. The types of test questions used were good. 

22. Held iny attention throughout the course. 

23. The demands of the students were not considered by the instructor. 

24. Uninteresting course. 

25. It was a very worthwhile course. 

26. Some things x^ere not explained very well. 

27. The way in which this course was taught results in better student learning. 

28. The course material was too difficult. 

29. One of my poorest courses. 

30. llaterial in the course was easy to follow. 

31. The instructor seemed to consider teaching as a chore or routine activity. 

32. Ilore outside reading is necessary. 

33. Course material x</as poorly organized. 

34. Course was not very helpful. 

35. It was quite interesting. 

36. I think that the course was taught quite well. 

37. I would prefer a different method of instruction. 

38. Tlie pace of the course was too slow. 

39. At times I was confused. 

40. Excellent course content. 

41. The examinations were too difficult. 

42. Generally, the course was well organized. 

43. Ideas and concepts were developed too rapidly. 

44. The content of the course was too elementary. 

45. Some days I was not very Interested in this course. 

46. It was quite boring. 

47. The Instructor exhibited professional dignity and bearing In the classroom. 

48. Another method of instruction should have been employed. 

49. Tlie course x^as quite useful. 

50. I would take another course that was taught this way. 



Table 2 

CEQ Items Grouped by Subscales 



01. General Couroe Attitude (G.C.A.) 

2. It was a waste of time. 

3. Overall, the course v;as good* 

11. Not much was gained by taking this 
course . 

20. The course increased my general 

kncywledge. 
25. It t^as a very worthwhile course. 
29. One of ray poorest courses. 
34. Course was not very helpful. 

The course was quite useful. 

02. Method of Instruction (M.I.) 

1. I learn more when other teaching 

methods are used. 
6. ilore courses should be taught this 

way. 

8. I would have preferred another 

method of teaching in this course. 
27. The way in which this course vies 
taught results in better student 
learning. 

36. I think that the course was taught 
quite well. 

37. I would prefer a different method 
of instruction. 

48. Another method of instruction should 

have been employed. 
50. I would take another course that 

was taught this way. 



04. Interest and Attention (I. A.) 

7. The course held ny interest. 
9, It was easy to remain attentive. 
14 • It was difficult to remain attenCive, 

22. Held my attention throughout the 

24. Uninteresting course. 
35. It was quite interesting. 

45. Some days I was not very inter»^.sted 
in this course. 

46. It was quite boring. 

05. Instructor (Instr.) 
5. The instructor seemed to be inter- 
ested in students as persons. 

10, The instructor did not synthesize, 

integrate, or summarize effectively. 
12 • The instructor encouraged the devel- 
opment of new vicK^oints and 
appreciations . 

15. Instructor did not review promptly 
and in such a way that students could 
understand their weaknesses. 

18. The instructor had a thorough knowl- 
edge of his sd^ject matter. 

23. The demands of the students were not 
considered by the instructor. 

31. The instructor seemed to consider 
teaching as a chore or routine 
activity. 

47. The instructor e^ibited professional 
dignity and bearing in the classroom. 

Specific Item (S^I.) 
4. The textbook was very good. 

16. Homework assignments were helpful in 
understanding the Course. 

17. There was not enough student partici- 
pation for this type of course. 

21. The types of test questions used were 
good. 

32. More outside reading is necessary. 

33. Course material was poorly organized. 
38. The pace of the course was too slow. 

41. The examinations were too difflculc. 

42. Generally, the course was well organ- 
ized. 

43. Ideas and concepts were developed too 
rapidly. 



06. 



03. Course Content (C.C.) 

13. The course material seemed worth- 
while. 

19. The content of the course was good. 
26. Some things were not explained very 
well. 

28. The course material was too difficult. 
30. Ilaterial in the course was easy to 

folloV7. 

39. At times I \^as confused. 

40. Excellent course content. 
44. The content of the course was too 

elementary. 




Glllmore 4, 
in the physical sciences (Gillmore and Stallings, 1971) • Eisenhart (1968) 
defined precision as the typical closeness together of successive 

Independent measurements of a single nmgnltude generated by repeated 
applications of the process under specified conditions'' (p. 1201) • 

The social and behavioral sciences have a problem within this domain, 
however, which physical scientists typically do not ahare; namely, memory, 
A physical scientist can weigh the same block of lead repeatedly, for example, 
and obtain a collection or distribution of estimates of its weight. The 
precioion of the measurements can be assessed in a reasonably straight-f on^^ard 
manner, essentially as a function of the "spread" of the distribution. 
However, if one vjishes to assess the reliability of the measurement of an 
attitude held by an individual, he cannot repeatedly ask the same question 
with no intervening time. Human short-term memory is close enough to perfect 
to suspect little or no variation in responses, even though the measurement 
of the attitude may be potentially quite imprecise.^- 

Two basic approaches are taken to circumvent this problem* First, 
multiple independent measurements can be taken simultaneously. This could 
be done by having multiple persons rating the same person on a given 
attribute or by asking the same person a series of somei^hat different 
questions which all relate to the same attitude under investigation. 
Second, the same measurement can be taken twice, VTith an intervening time 
interval considered, on one hand, long enough to assure that memory of the 
first response is not a determiner of the second measurement and, on the 
other hand, short enough to assure that the thing being measured has not 
changed drastically in the interim. 

There are also two basic types of reliability estimates, "A retest 
after an interval, using the identical test, indicates how stable scores 
are and, therefore, can be called a coefficient of staHHty. The correlation 



Glllmore ^ ^ 

betv;een two forms given virtually at the same tine is a coefficient of 
equivalence showing how nearly two measures of the same trait agree ^" 
(Cronbach, 1951, p. 298). 

The majority of this paper v;ill be aimed at the former type^of relia- 
bility. Stability is being emphasized for two reasons. The first reason 
Is neither subtle nor profound. There is no way to assess the equivalency 
of a single item. Thus, we are left V7ith the ability to look at equivalency 
only in the case of subscales. The second reason relates to purpose. In 
the context of . the results of an instructor's course-devaluation, the ques- 
tion to which estimates of stability relates is: To what extent are these 
results consistent with results which would have been obtained with the 
same course taught to an entirely different set of students from the same 
population? The question to which estimates of equivalency relates Is: 
To what extent do the measurements of this aspect of teaching go together? 
To put the same question another way: To what extent do the measurements 
seem to be assessing a single underlying trait? For this paper, estimates 
of stability seem to be the more Important of the two. However, data on 
the equivalence of the CEQ subscales will be presented and discussed sub- 
sequent to the presentation and discussion of the stability coefficients. 
The Estimation of Reliability Coefficients 

Generally, reliability coefficients must be estimated from data obtained 
from a sample of measurements or measurements from a sample of people. For 
purposes of the present paper, they must be estimated because the population 
of all Instructors has not been measured* Indeed, no one can be sure what 
the population really is. 

In the ideal situation, one tries to choose a sample completelv ran- 
domly from a wall-defined population.^ Often, however, the ideal is not 




^By randomly, we mean that the method of selection of a sample assures each 
member of the population an equal chance to become a member of the sample. 



Gilimore 6. 
obtainable simply because of the impossibility or impracticality of selec- 
ting some members of the population for the sample. 

The problem of a non-random sample is especially severe in the present 
study. Although Form 66 of the CEQ has had extensive use, especially at the 
University of Illinois but also at other institutions, its administration has 
never been mandatory. Ilost of the data used in this study was obtained from 
courses vrtiose instructor chose to adp.inister the CEQ. 

The raters, i.e., students, who filled out the forma are not random 
either. Students do not enroll for courses by random selection. Furthermore, 
in some cases, the same students have undoubtedly rated several different 
instructors. In other cases, different instructors are rated by completely 
independent raters. In a few cases, the same students may have rated the 
same instructor more than once. These uncontrolled factors tend to contam- 
inate the data in largely unknox-m v/ays. 

The effect of these biases cannot be completely assessed and certainly 
not eliminated. Koxjever, different methods of estimation can minimize some 
of the biases. Using multiple methods also can give some confidence that the 
range of reliability coefficients captures the "true** reliability. Finally, 
if the varying methods and samples give similar results, more confidence can 
be given to the accuracy of those results. It is for these reasons that the 
multiple method-multiple sample approach was adopted. In all, six different 
stability estimates will be presented for each item and five for each sub- 
scale and the total score. Three equivalence coefficients for each subscale 
will also be presented follox^ing the presentation of the stability coefficients. 

Stability Coefficients for the Items and Siibsca7.es 
MeiMod 1: The Intraclass Correlation Coefficient 

One can look at the set of students \<jho rate each instructor as raters. 
Each section which is rated can be looked at as a group. Then, within analysis 

ERLC 



Gillmore 7 , 

of variance language, the situation with many sections being rated is basically 
a one-way design v/ith students nested within groups. Since it makes little 
sense to generalize all results to the particular set of sections in which 
the CEQ has been used, sections should probably be considered as a random 
sample of all possible sections and thereby adopt a random effects or variance 
components model. (Computationally, in a one-way design, the two designs are 
identical) . 

Using this model, one can partition the total sum of squares from actual 
data into that due to groups (sections) and that due to raters within groups. 
The reliability of the raters can be estimated from this data by use of the 
intraclass correlation. (For derivations of the intraclass correlation, see 
Ebel, 1951). In this case, since differences in level of rating between raters 
does make a difference in the evaluation an instructor receives^ the "between 
raters" variance should be part of the error term. "But, if decisions are 
made in practice by comparing single 'rav;' scores assigned to different pupils 
(instructors) by different raters, or by comparing ccoevages which come fvom 
different groups of raters^ then the 'between-raters* variance should be in- 
cluded as part of the error terra" (Italics mine) (Ebel, 1951, p. 412). Since ' 
are assuming completely independent raters for each section, the between raters 
variance is, indeed, automatically a part of the error term. 

If we were interested in the rel'f.ability of the rating of an individual 
"average" rater, the formula for the intraclass correlation of relevance would 
be as follows: 

r = MSB - MSW (1) 

MSB + (K-1)(MSU) 

where MSB refers to the Mean Square between groups or sections, 

MSW refers to the Mean Square between raters within groups, 
and K is the number of raters per group. 



Gillmore 8c 
l/hen K is not constant, It can be estimated with the following formula: 

1 "'i' 

where n « number of groups and is the number of raters in the i^^ group* 
However, X7e are interested in the reliability of the ratings of the total set 
of raters for each sect Ion, since instructors are evaluated in terms of clas^ 
means rarher than individual ratings. Thus, the appropriate formula becomes 
as follows: 

t IISB - MSW 
^ " MSB 

r' is very close to the value of r inflated by the Spearman-Brown Prophecy 
Formula x^ith the n, which usually refers to the increased length of a test, 
in this case referring to the average number of raters. The Spearman-BroxTn 
Prophecy Formula is as follows: 

(n-irr^l 

The intraclass correlation was computed on three different samples of 
CEQ results.^ 

Scorplc 1 

The most alluring aspect of Sample 1 is l^'s immensity. The sample 
contains data from the 5,3^6 sections whose instructors gave the CEQ between 
the years 1966 and 1970. Of these sections, 2,782 were taught at the University 



^The P value from the analysis of variance can be computed from the intraclass 
correlation as follows: ^ 1 . One can note that if the reliability 

P » 1 ^ 

is zero, F « 1, its expected valua. One can also note that if the reliability 
is perfect, F is infinite. 



Gillmore 9 ^ 

of Illinois (Urbana campus). The remaining sections were taught at 18 differ- 
ent colleges and universities across the country. In all, iliese data arc 
ratings from 105,576 raters (not necessarily all different). 

One assumption of the model which has been adopted is that the sample of 
sections is randomly chosen from the entire population. This is, of course, 
not true of these data since they were collected from volunteers. Furthermore, 
some instructors and courses are represented more than once. This could be 
possible with random selection, however, witn the non-random method used for 
this sample, it probably worsens any bias there might be. 

Also, according to the model, raters should be randomly assigned to sec~ 
tions. Furthermore, no rater should rate more than one course since otherv;ise 
dependencies will be evident in the data, i.e., the correlation built in between 
a rater's rating of one course and another. These data do not strictly satisfy 
either requirement. It is not always clear why students choose the courses they 
do, but it is clearly not random.- Furthermore, there is little doubt that 
some students rated more than one course within the sample. 

Given these obvious limitations of the data, even the most iniperceptive 
of readers might reasonably inquire as to x^hy analysis was carried out at ail- 
Beyond the natural passion of statisticians fer large amounts of data, the 
biases may not be as destructive as one might at first suspect. First, the 
greatest effect of the volunteer nature of the saiiiple would probably be to 
restrict the range of responses. For example, one might expect really poor 
teachers to tend not to give the instrument, and thus, one tail of the dis- 
tribution would be smaller in the sample than in the population. It is 
difficult for this researcher to conceive of a reason to expect the bias 
of the sample to ir^rease the range unless one suspects that good and bad 
Instructors tend to give the instrument, while mediocre instructors tend 
not to. However, the distribution of the data so closely approximates the 



Glllmore 20. 
normal distribution (see Gillmoi.e, 1971) » this doeo not look likely. The result 
of restriction of range is charactc •.^istically to lower reliabilities (more 
basically, correlation). In th^ prunent context, the restriction of range 
clearly lowers the aean square between groups, but probably does not effect 
the within group mean square. Thus, the affect of this biai: is probably to 
lower the roliability cGtimates rather than to inflate them. 

Similarily, the effect of having an instructor or course rated more than 
once, but treating th.^ data if it were independent ratings, would seem to 
have a depressing effect on the reliability estimates. If two independent sets 
of raters rate the same instructor or course, the resulting ratings would cer- 
tainly tend to be closer together than if two independent sets of raters rated 
two differo:it inatructorc or courses. This again would suggest a reduction of 
the between groi.ps Gum of squares without an accompanying reduction of within 
group t;um of squares. 

Finally, the effect of having some of the same students rating two dif- 
ferent courses or instructors v/ould also seem not to have much effect. Consider 
the case where every student rates every instructor or course. This is a com- 
pletely crcGsed two-^jay snnlycis of variance design with students as a random 
factor and instructors either random or '"ixed. In this case, the total sum of 
squares is partitioned into that due to instructors, to students, and to the 
instructor by student interaction. T? . typical error term for instructors is 
the latter. Ho\;over, for tho. appropriate intraclass coefficient, the betwe ;n 
raters vaiiar.cc ±z clr.o pnrt of tha error. Thus, the two analyses vjould seem 
to be equivalent. 

The 1^3t biro mc-ntioned nbove was the lack of random assignment of raters 
to instructoiGc It night be pointed out that random assignment would make 
no ccnsj in tho educitio:!^! seating in which we are working, but that is only 
saying that the model dor.3 not fit. Unfortunately, the effect of this lack 
of fit is ncl: clear, nor can this reser.rcher make any very intelligent guesses • 



Gillmore 



i.l. 



All of these influences in combination would seem to give some confidence 
that empirically determined reliability estimates by the method described above 
V7ould not tend to be overestimates; indeed they might be more aptly considered 
underestimates . 

The resulting stability coefficients for Sample 1 are found in 'i.able 3. 
Because of the large amount of data, a completely accurate computation of the 
mean square between and mean Square within for Foraula 2 was not feasible. The 
average section sizn was used to compute the mean squares rather than using the 
correct vjeighted average. Also, the stability estimates for subscales and the 
total score were not calculated for this sample. They were calculated for all 
other sanples. One can note that the coefficients for items range from .756 
for Item 32 to .911 for item 4. The mean of the 50 reliability estimates is 
.854. 

Sample 2 and 3 

One of the problems inherent in Sample 1 was that instructors could appear 
more than once in the sample. Similarly, courses could also appear more than 
once. To alleviate this problem, a sample was randomly chosen from University 
of Illinois (Qrbana campus) courses taught fall term, 1971-72, who used the 
CEQ. The sample contained 200 courses, hox^ever, no course or instructor was 
allowed to appear In the sample more than once. A second sample of identical 
size was choren *:^r purposes of replication. The only additional criterion 
for exclusion was that no particular section could appear in the second sample 
which had appeared in the first. Thus, Sample 2 and 3 were independent random 
samples xd.thin the limits mentioned above. (The non-representativeness of the 
population from which the sample was dra\7n still remains a shortcoming.,) The 
average number of students per class V7as 34.92 for Sample 1 and 28.15 for Sample 



Table 3 

Reliability (Stability) Estimates for the Items, Subscales» and 
Total Scores of the CEQ by Three Different Methods and Six Different Samples.* 





Ilethod 1 - Intraclass 


Ilethod 2 
Test- Ret est 


Method 3 - 


Split-half 


Item 








Sample 4 


Sample 5 


Sample 6 










(N=103 Pairs) 


(N=103) 


(N»103) 


X 


034 


868 


849 


652 


880 


806 


o 


846 


879 


889 


595 


853 


828 


J 


864 


911 


905 




850 


877 


^ 


911 


932 


899 


71 1 


911 


851 


3 


883 


910 


910 


J^l 7 

Ox / 


O / X 


878 


6 


881 


927 


904 


u^o 


895 


847 


•7 

7 


870 


907 


905 




859 


856 


o 
o 


853 


892 


874 




837 


789 


V 


880 


927 


909 




901 

^ v/x 


873 


10 


835 


908 


878 




P j9 
jy 


842 


11 


843 


870 


887 




844 


824 


\l 


864 


913 


907 


7m 

/ \JX 


754 


813 


IJ 


837 


874 


863 


\J XH 


749 


789 


lA 


874 


920 


906 


662 

\J \J £m 


869 


843 


l3 


827 


892 


876 


5^^Q 


799 


812 


lb 


877 


882 


866 


720 


793 


862 


1/ 


874 


898 


852 


7m 

/ ux 


721 


791 


T O 

lo 


852 


913 


888 


747 


828 


839 


19 


842 


888 


885 


599 
tj y J 


775 


849 


zO 


819 


852 


845 


690 


797 


733 




885 


884 


876 


621 

\J £mX 


850 


824 




876 


914 


902 


yj 1 1 


885 


871 




828 


877 


871 


670 


845 


747 




860 


901 


908 


661 

\J\J X 


850 


858 




869 


895 


904 


65A 


819 


867 


Zo 


864 


924 


887 


736 


862 


905 


11 


877 


922 


896 


691 


867 


845 


O Q 

zo 


831 


872 


852 


582 


731 


730 




834 


896 


897 


652 


771 


830 


on 


864 


909 


890 


711 

/ XX 


812 


858 


31 


846 


888 


892 




750 


863 


32 


756 


845 


731 




686 


669 


33 


839 


889 


856 


717 


817 


849 


34 


845 


875 


874 


660 


ol/ 


o JZ 


35 


868 


917 


914 


696 • 


863 


905 


36 


886 


924 


915 


758 


881 


887 


37 


855 


896 


865 


738 


830 


830 


38 


804 


815 


808 


725 


801 


828 


39 


885 


924 


890 


783 


889 


885 


AO 


867 


916 


903 


699 


825 


830 


41 


908 


901 


895 


725 


847 


824 


42 


843 


887 


870 


639 


821 


833 


43 


839 


901 


866 


689 


781 


841 


44 


810 


795 


731 


548 


543 


686 


45 


855 


888 


850 


576 


826 


757 



Table 3 (cont.) 



Iten 


iJethod 1 - Intraclass 


Method 2 
Test -Re test 


Method 3 ■ 


- Split-half 


Sample 1 
(N-5346) 


Sample 2 
(N=200) 


Sample 3 
(N=200) 


Sample 4 
(I]ol03 Pairs) 


Sample 5 
(1^=103) 


Sample 6 
(N-103) 


46 
47 
48 
49 
50 


855 
817 
854 
859 

869 


895 
854 
896 
889 
905 


899 
812 
869 
885 
897 


685 
576 
713 
669 
671 


824 
730 
829 
817 
861 


854 
751 
838 
839 
834 


Subacales 


G.C.A. 
II. I. 
C.C. 
I.A. 

Instr. 

S.I. 

Total 




916 
931 
940 
933 
938 
923 
945 


918 
914 
920 
926 
931 
895 
932 


698 
733 
725 
704 
730 
697 
728 


875 
899 
875 
900 
864 
867 
900 


893 
871 
886 
893 
883 
903 
906 



^'^Decimal points have been eliminated for ease of reading in this table and all 
subsequent tables* 



Glllmore 12 . 

The intraclass reliability coefficients calculated on theye two samples 
appear in Table 3. The range of ostiniatos for the items for Sample 2 was from 
* 795 for item 44 to .905 for iter.i 50. The range for the items for Sample 3 was 
from .731 for item 32 and 44 to .897 for item 50. The average stability coef- 
ficient for Sample 2 was .r.53, the aver:.rja for Sample 3 was .876. 

Also found in Tabic 3 arc the re;;ults for the subscales and the total scores. 
In this case and subsequently, thci stability coefficients are computed on the 
subscale means and the meantz of the total scale. 
Method 2: Test-Retest 

The second method of reliability estimation v;as similar to the test-retest 
method. Usually, in test-rcter^t , the instrument is administered tv/ice to the 
same group of subjects;, v;ith a period of time intervening judged to be long 
enough that subjectt^ hav. forgotten specific responses they had made but short 
enough that true changes in the variable being measured would not be expected. 

Since the same set of studentc do not take the same course by the same 
instructor t.ice (except possibly the very small subset who fail the first time), 
the traditional test-retest method v;as not possible to implement. A variant of 
this method was possible houe/er. 

Frequently, the smnc instructor may teach two or more sections of the 
same course. Thi.s procedure is most coninon in lower level courses. If the 
same instructor teacher, tv;o sections of the same class, and uses the CEQ in both, 
one would ercpect thr-.t the two sets of ratings would be much more similar than, 
say, two different iuotructors teaching two different courses; that is, if the 
instrunent is reliable. If^ on the other hand, the ratings of the same instruc- 
tor teaching two occcions of the came course vjere not similar, the reliability 
of the instruTient would be highly questionable. The correlation of the means 
of the two sets of ratings, correlated over instructors who taught two sections 
of the same couvse, can be considered a reliability estimate. 



Gillmore 13. 

There Is a problem with this estimate of reliability, hov;ever. As with 
twin studies, which mean of an instructor goes into which of the tv;o groups, 
is not clear. In the traditional test-retest method, the outcome of the first 
testing is correlated v;ith the outcome of the second testing, llox-jever, it makes 
little sense to correlate the means of the first section that meets in a day 
with the Tieans of the second section that meets in a day. On the other hand, 
random placement of sections into groups results in a variability of resulting 
correlations v/hich need not be present. In the worst possible case, a researcher 
could do much to influence the size of a correlation by post hoc arrangement. 

The proper way to alleviate this problem is to create a symmetric table 
(Treloar, 1942, pp. 11-13). For this method, each pair of observations is 
entered into the data matrix tv;ice, once in each order. The correlations which 
result from this method are reliability coefficients (Jensen, 1971). They are 
a form of the intraclass correlation. However, they should be considered as 
"lower-bounds" to a test-retest reliability. We can be reasonably confident 
that the "true" test-pretest correlation is not lower than these coefficients. 
There are tV7o reasons for this statement. First, correlations computed from 
a symmetric matrix are smaller than correlations resulting from cmy combination 
of the same values in a non-symmetric matrix. Second, and probably more important, 
the comparisons are made between pairs cr sections which differ in many important 
ways. The most important is different sets of raters. Other differences are 
time of day, size of the class, fatigue or practice effects on the instructor, 
etc. Thus, those stability coefficients are expected to be smaller than those 
presented previously. 
Sample 4 

During the academic year, 1970-71, a large proportion of instructors at 
the University of North Carolina, Greensburo (U.N.C.), used the CEQ. As a 
by-product, many instructors who taught two or more sections oi the same course 



Glllmore 14., 
used the CEQ for these sections. In all, 103 different instructors (and 
103 different courses) fell within this cate^^ory. I/lien an instructor taught 
more than two sections of tlie same course, tv70 of the sections were randomly 
chosen. Thus, the reliability was calculatedj, v-sing a symmetric matrix for 
103 instructors and courses, v/ith tv;o sections of each. 

The results of application of Method 2 on Sample 4 can be found in Table 
3 for all items. The range of reliabilities are from .533 for Item 32 to 
.817 for Item 5. The average reliability over the 50 items was .671. The 
results for the six subscales are also found in Table 3 as is the result for 
the total instrument. 
Method 3: Split-half 

The split-half method of calculating reliability is most coinmonly applied 
to situations in v/hich a group of subjects respond to an instrument V7ith many 
items. These items are typically attempting to measure a construct, such as 
knov7ledge in an achievement test, or a specific attitude in an attitude 
questionnaire. The items are split into two groups, usually the odd numbered 
items go into one group, the even numbered items go into the other. The 
means of the tX7o groups are then correlated over subjects. Finally, the 
resulting correlation is raised by use of the Spearman-Broxm Prophecy Formula 
for tests of double length^, V7hich is then a measure of equivalency of the set 
of itctns. 



^The Speannan-BrovTn Prophecy Formula for tests of double length is as follows: 

2r.. 

11 

^tt *^ 1 + r.. 

11 

where r^^ is the correlation between the two halves, and r^^ is the total 
reliability. 



Gillmore 15, 

This same procedure can be follox^;ed for a single item with multiple 
raters • Raters can be split into two groups, and the average of the two 
groups correlated over sections. This result also needs to be raised by the 
Spearman-Brown Formula to become a reliability estimate. However, the result 
of applying the split-half method to raters is a stability coefficient rather 
than an equivalence coefficient. 
Sample 5 and 6 

The 103 pairs of sections used in Sample 4 also comprised Samples 5 and 6. 
One member of each pair was arbitrarily assigned to Sample 5, the other to 
Sample 6. Then within the two samples, the raters from each section were 
divided into two groups by means of an odd-even split based on the way the data 
were naturally ordered. Then, the means of the odd raters were correlated with 
the means of the even raters for each item v;ithin each course. Finally, the 
resulting correlations were corrected by means of the Spearman-Brown Prophecy 
Formula. The results are found in Table 3, For Sample 5, the stability 
coefficients for items ranged from .543 for Item 44 to ,911 for Item 4. For 
Sample 6, the stability coefficient for items ranged from ,669 for Item 32 to 
,905 for Items 26 and 35, The averages were ,819 and .829 for Sample 5 and 
Sample 6, respectively. 

Consistency of Stability Estimates 

At this point, tv/o considerations seem warranted. First, is there any 
consistency among the reliability estimates? Essentially, x^e are asking a 
seldom asked question: Are the stability coefficients reliable? If they are 
not, we would be hard-pressed to justify faith in their veracity. Secondly, 
given an adequate degree of consistency, what do they indicate? Or, to put 
this question into an over-simplified form: Are the items of the CEQ reliable? 



Gillmore ^r^^ 

To address the reliability of the reliabilities; question, the six 
different reliability estimates were intercorrelated over the fifty items. 
(The reader should note that this **.s not a recommended procedure and the 
results should be interpreted with some caution.) The results ace presented 
in Table 4. Subscales were not included in this analysis because of the small 
amount of variation in their stability estimates within ine^nods. As can be 
seen, the correlations among the various techniques are reasonably large. The 
"test-retest^' method tended to show the least agreement with the other methods, 
but even these correlations were .499 and above. 

As an overall assessment of the consistency of the reliability estimates, 
the cverage off-diagonal correlation was calculated (.672), and corrected by 
the Spearman-Brov7n Prophecy Formula using a test of length six times the original. 
The resulting value, which is an estimate of the alpha coefficient, was .925.^ 
We take this to suggest a consistency high enough to allow faith in our results. 
However, there still may be a source of systematic error which could affect all 
methods similarly and, therefore, not shov; up in the intercorrelations . An 
example of one such source is the volunteer nature of all the data- 
in evaluating the magnit:ide of the reported reliability estimates, it must 
be noted that the re are degrees of reliability, and while zero reliability, 
(i.e., complete unreliability) makes measurements unsuitable for* any use, 
varying degrees of reliability are adequate for different situations, depending 
essentially on the fineness of diccrimination needed, the supplemental data 
available, and the importance of resulting decisions. Thus, in reality, one 
cannot evaluate the stability coefficients in any abstract sense. However, 
in considering the practical meaning of the various indices of reliability 



The alpha coefficient is a measure of equivalence which will be discussed 
in the subsequent section. 



Table 4 

The Correlations Among Reliability Estimates 



San:f>les 




Method 1 




Method 2 


Method 3 


1 


2 


3 


4 


5 


6 


I. 


1000 












2 


765 


1000 










3 


765 


840 


1000 








4 


555 


517 


499 


1000 






5 


632 


698 


751 


525 


1000 




6 


653 


697 


807 


665 


710 


1000 



Gillmcre 18. 
'I'hich have been described, another almost paradoxical issue comes to the fore; 
that of class size. 

Typically in studies of reliability, numbers of subjects have little 
effect. This is because, in general, the magnitude of a correlation coefficient 
is unaffected by sample size vother than the fact that there is more variability 
associated with smaller sample size^:. However > the length of the measuring 
device has an effect. Generally, reliability increases x^ith increased length, 
other things beiuf;, c : • :.).. However, since in the present situation we use raters 
analogously to items, class size does make a difference in the reliability of 
the class mean on a particular item. The reliability of a mean rating on an 
item in a class of 100 will generally be greater than for a class of 10, though 
not ten times greater. So what has been presented above are stability coeffi- 
cients for average size classes! To get a fix on what effect the class size 
variable has, we can go back to the intraclass reliability coefficient for an 
individual rater. Ug present these values as calculated from Sample 1, 2, and 
3 in Table 5. In all three samples, the smallest intraclass value, rounding 
off to two places, is .09 lor Items 32 and 44 in Sample 3. The largest value 
is .33 for Item 41 in Sample 1, The average value is about -21. 

These values essentially represent \oaat the reliability of an individual 
rater would be. To continue our analogy x-jith itemQ, these reliabilities are 
comparable to the reliability of a single iteia, or the average off-diagonal 
correlation among items. And, just as a reliability for a set of items can be 
computed by applying the Spearman-Brovm Prophecy Formula to average off-diagonal 
interitem correlations and the number of items^, so can the reliability of 



^Reliability = 



nr 



ERIC 



(n - 1) r + 1 
where n = number of items and 

r = average off-diagonal correlation. 



Table 5 

Intraclass Correlations for Individual Raters 
Computed on Sample 1, 2, and 3 



xtem 


Sanple 1 


Sample 2 


Sample 3 


1 


202 


155 


161 


2 


217 


168 


216 


3 


243 


222 


246 


4 


342 


276 


232 


5 


276 


219 


258 


6 


272 


262 


244 


7 


253 


214 


246 


8 


227 


187 


192 


9 


2 70 


260 


255 


10 


205 


215 


197 


11 


214 


157 


211 


12 


244 


226 


250 


13 


206 


162 


178 


lA 


259 


242 


249 


15 


194 


188 


194 


16 


265 


173 


182 


17 


259 


197 


164 


18 


225 


226 


215 


19 


212 


181 


209 


20 


186 


138 


158 


21 


280 


175 


195 


22 


264 


229 


241 


23 


196 


165 


188 


24 


202 


252 


237 


25 


192 


244 


252 


Zo 




213 




27 


248 


229 


266 


28 


160 


165 


199 


29 


194 


230 


202 


30 


218 


217 


244 


31 


181 


221 


217 


32 


132 


085 


136 


33 


183 


170 


208 


34 


163 


192 


216 


35 


235 


268 


250, 


36 


254 


269 


282 


37 


194 


180 


231 


38 


109 


126 


172 


39 


252 


217 


280 


40 


2 34 


243 


247 



Table 5 (cont.) 



Item 


Satnple 1 


Satnple 2 


Sample 3 


41 


202 


227 


334 


42 


179 


186 


214 


43 


203 


182 


209 


44 


097 


085 


178 


45 


181 


163 


229 


46 


193 


235 


231 


47 


140 


129 


184 


48 


194 


186 


229 


49 


182 


209 


235 


50 


209 


231 


251 


Subscales 


G.C.A. 




233 


2 79 


M.J . 




272 


268 


C.C. 




303 


282 


I. A. 




281 


301 


Ivztx. 




296 


317 


S.I. 




250 


226 


Total 




322 


320 



o 

ERIC 



Gillmore 20 . 

ratings be assessed by applyi:\g the Spearman-Brot^m Prophecy to the intraclass 
correlation for an individual rater and the total number of raters. By this 
means, we can get an idea of what the reliability for a given item will be 
within a class of a certain size. 

To get an idea of the effect of class size, the results of application of 
the Spearman~Bro\^ Prophecy Formula plotted as a function of classes of size 
1-40, for the high, lov7, and average intraclass correlation mentioned above is 
in Figure 1. As can be seen, a reliability of .80 is achieved for even the 
lowest intraclass correlation with a class of 40 or above. For the average 
value, the .80 magnitude is reached by class size of 15. Finally, for the 
highest intraclass correlation, a class size of 11 is sufficient to reach .80. 

In most contexts, a stability coefficient for a single item of .80 would 
seem to be sufficient. Indeed, .70 may very v;ell be an acceptable figure. 
If so, it is obtained for all but the smallest class for all but the lov;est 
intraclass correlations. In general, it seems fair to say that the items of 
the CEQ have adequate; stability, although one should realize that as claf"" 
size decreases, interpretations may become more tenuous. 

The same general statement can be made for the subscales. Hov/ever, the 
stability coefficients for subscales definitely tend to be as high or higher 
than any of those for the items, as x^rould be expected. Thus, somewhat more 
confidence can be lent to the results of subscales than items. 

Equivalence Coefficients for the Subscales 

Subscales by their very nature are made up of collections of items. 
Subscale scores, or subscores, are calculated by summing or averaging the 
scores of the items contained in the subscore. Thus, it is reasonable to 
question the relatedness, or equivalence, of the member items. For example, 
if they all measure completely different attributes, the single number 
representing the subscore has little meaning. 



Gillmorc 21, 

Coefficient alpha is a rellnbility estimate assessing the equivalence 
of a set of items (Cronbach, 1951) • The formula for alpha Is as follows 

^ =■ k - 1 v~-^ 

t 

where k is the number of items, 

th 

is the variance of the i item, and 
is the variance of the sum of the items. 
Cronbach (1951) has shovm that coefficient alpha is the average of all 
possible split-half reliability coefficients. Furthermore, in the case where 
all items have equal variances, coefficient alpha is exactly equal to computing 
the average off-diagonal correlation among items and entering it into the 
Spearman-Bro^TH Prophecy Formula (see footnote 5), where n is the number of 
items. 

Coefficient alpha was computed for all subscales for Samples 1, 2, and 3. 
The results are found in Table 6. Since the unit of analysis is sections, in 
all cases section means rather than individual ' student ratings \<raxe entered into 
this analysis. 

The equivalence coefficients for all three samples are very consistent 
within subscales. Three subscales. General Course Attitude, Method of Instruc-' 
tion, and Interest and Attention, show very high reliability. The reliabilities 
of Course Content and Instructor are moderately high. Finally, Specific Items 
has a considerably lower reliability than the others. 

If one carefully reads the items which form each subscale (Table 2), he 
can see reasons for the existence of these differences. The three subscales 
which show extreme reliabilities all contain nearly equivalent items. Differ- 
ences among the items are mainly subtle wording changes, i.e. synon)rms. The 
Instructor subscale, on the other hand, contains somewhat dissimilar questions. 



Table 6 

Correlations Among the Items of the 
General Course Attitude Subscale* 



Sample 1 



Items 


2 


3 


11 


20 


25 


29 


34 


3 


882 














11 


905 


383 












20 


783 


770 


819 










25 


871 


895 


911 


831 








29 


752 


786 


770 


706 


780 






34 


869 


867 


907 


831 


914 


786 




49 


850 


863 


897 


830 


931 


777 


919 


Sample 2 


3 


926 














11 


948 


938 












20 


863 


879 


892 










25 


922 


928 


945 


902 








29 


914 


941 


923 


854 


919 






34 


928 


928 


950 


896 


951 


934 




49 


907 


904 


935 


891 


952 


905 


952 


Sample 3 


3 


931 














11 


926 


876 












20 


864 


843 


869 










25 


925 


925 


921 


884 








29 


879 


911 


353 


806 


882 






34 


905 


887 


928 


892 


929 


866 




49 


910 


899 


917 


898 


961 


864 


938 



^fltems 2, 11, 29, and 34 have been reverse scored such that a high score 
for all items represents a favorable response. 



Gillmore 23 
although all relating to the instructor of the course. The Course Content sub- 
scale has clu:>r.Lrs of ruL-aninc. Hour. .13, 19, and AO are very sir.ilar, ail doalint 
with the value of course material. Items 28, 30, 39, and 44 all deal with 
the relative difficulty of the course material. Finally, Item 26, and to 
some extent 39, deals with the instructor's explanation of the course material. 
Thus, although all items relate the course content, they are not all homogeneous 
in content, hence, the lower reliability* 

The Specific Items subscale is an even more extreme case of non-homogeneous 
item content. A whole collection of course related topics is included in the 
items of the subscale and, therefore, less equivalence is the result. Indeed, 
this collection of items stretches the definition of a subscale. 

To get an idea of the relationship among items, the correlations among 
items v/ithin subscales are presented in Tables 7 through 12 • Correlation 
matrices from all three samples are included. Again, these correlations are 
computed over section means • The preceding comments concerning the content 
of the items are generally borne out by the structure of these matrices. One 
can note consistently high correlations among the items of the General Course 
Attitude, Method of Instruction, and Interest and Attention subscales. Corre- 
lations among the Instructor items are consistent but lox^er. The Course Content 
subscales show definite clustering as suggested above. Finally, the Specific 
Items subscale contains generally lov7 correlations, but some clustering can 
be seen. 

Conclusion - The Stondcxrd Error of Measurement 
This report will conclude both a brief discussion of the meaning of the 

reliability estimates which have been presented in this paper in terms of the 

Standard Error of Measurement (S.E.M.). 

S.E.M. is essentially the standard deviation of a hypothetical distribution 

of observed scores around a *'true*' score; essentially, a distribution of 



Table 7 

Correlations Among the Items of the 
Hethod of Instruction Subacale^* 



Sample 1 



Items 


1 


6 


8 


27 


36 


37 


48 


6 


832 














8 


892 


887 












27 


848 


915 


894 










36 


819 


860 


854 


871 








37 


882 


885 


925 


900 


883 






48 


885 


895 


931 


910 


879 


942 




50 


802 


909 


851 


883 


835 


859 


877 


Sample 2 


6 


899 














8 


911 


934 












27 


870 


945 


924 










36 


867 


912 


896 


909 








37 


902 


931 


958 


933 


898 






48 • 


911 


930 


947 


915 


899 


946 




50 


883 


942 


923 


938 


902 


922 


928 



Sample 3 



6 


901 














8 


925 


939 












27 


887 


946 


919 










36 


901 


909 


901 


902 








37 


931 


936 


956 


927 


921 






48 


928 


917 


946 


905 


908 


951 




50 


887 


935 


905 


932 


899 


907 


895 



5^«Iteins 1, 8, 37, and 48 have been reverse scored such that a high score 
for all items represents a favorable response- 



Table 8 

Correlations Among the Items of the 
Course Content Subscale''^ 



Sacnple 1 



Items 


13 


19 


26 


28 


30 


39 


40 


19 


803 














26 


514 


588 












28 


446 


444 


525 










30 


452 


484 


631 


800 








39 


327 


366 


644 


711 


809 






40 


873 


900 


627 


448 


506 


394 




44 


438 


459 


190 


-188 


-210 


-248 


441 


Sample 2 


19 


916 














26 


595 


682 












28 


473 


495 


605 










30 


520 


570 


767 


780 








39 


467 


512 


763 


755 


864 






40 


914 


943 


713 


507 


615 


546 




44 


474 


490 


195 


-108 


-122 


-094 


468 


Sample 3 


19 


924 














26 


603 


694 












28 


611 


612 


673 










30 


582 


621 


784 


836 








39 


508 


557 


778 


806 


904 






40 


896 


928 


711 


601 


638 


573 




44 


371 


393 


229 


-050 


-096 


-138 


415 



*Iteins 26, 28, 39, and 44 have been reverse scored such that a high score 
for all items represents a favorable response. 



Table 9 

Correlations AmonB the Items of the 
Interest and Attention Subscale>^ 



Sample 1 



Items 


7 


9 


14 




22 


24 


35 


45 


9 


908 
















14 


908 


959 














22 


929 




939 












24 


917 


842 


859 




878 








35 


932 


869 


876 




VOA 


931 






45 


853 


852 


859 




875 


816 


841 




46 


915 


883 


895 




901 


920 


910 


841 






Sample 


2 




9 


950 
















14 


940 


974 














22 


959 


c\r 1 
964 


960 












24 


953 


918 


919 




923 








35 


970 


923 


911 




929 


950 






45 


879 


900 


896 




904 


844 


867 




46 


963 


944 


950 




943 


943 


948 


885 








Sample 


3 








9 


930 
















14 


918 


970 














22 


944 


960 


945 












24 


942 


893 


889 




905 








35 


956 


918 


908 




943 


949 






45 


876 


887 


877 




916 


849 


877 




46 


937 


923 


921 




938 


939 


939 


882 



'■^Iteins 14, 24, 45, and 46 have been reverse scored such that a high score 
for all items represents a favorable response. 



TablG 10 

Correlations Among the Items of the 
Instructor Subscale^ 



Sample 1 



Items 


5 


10 


12 


15 


18 


23 


31 


10 


570 














12 


706 


546 












15 


645 


788 


556 










18 


411 


589 


458 


462 








23 


771 


632 


696 


700 


445 






31 


760 


668 


685 


641 


596 


713 




47 


485 


540 


478 


457 


607 


491 


633 


Sample 2 


10 


639 














12 


725 


591 












15 


678 


833 


651 










18 


448 


626 


459 


531 








23 


770 


681 


807 


752 


478 






31 


804 


705 


749 


745 


611 


769 




47 


471 


627 


476 


531 


693 


485 


629 


Sample 3 


10 


605 














12 


670 


557 












15 


705 


845 


506 










18 


423 


647 


406 


545 








23 


791 


621 


706 


707 


420 






31 


797 


733 


756 


698 


630 


797 




47 


418 


586 


425 


437 


707 


411 


596 



''<Items 10, 15, 23, and 31 have been reverse scored such that a high sco 
for all items represents a favorable response. 



Table 11 

Correlations Anions the Xteias of the 
Specific Items Subscale^^ 



Sample 1 



ii.eiu3 


A 


1 


1 7 


9 1 




J J 


DO 


41 


42 




Til 


















1 7 


1 09 


9 09 
£. \J£. 
















9 1 


Ji. J 




















1 

X J J 






uoo 












33 


347 


371 


378 


491 


287 










38 


173 


306 


195 


214 


2 79 


393 








41 


165 


231 


321 


594 


044 


339 


-069 






42 


330 


398 


400 


509 


257 


910 


417 


329 






£» \J\J 




458 


349 


1 M 

xo^ 


*-r\J y 


-1 on 


539 


457 










Sample 


2 








JLO 




















J. / 




















91 

A. 




H / 


404 
















9QQ 


J- -/ o 


042 


204 












33 




407 


453 


549 


356 










38 


246 


319 


in 


310 


322 


423 








41 


282 


253 


498 


677 


132 


445 


111 






42 


441 


398 


508 


564 


287 


935 


416 


451 




43 


356 


109 


496 


417 


248 


489 


-079 


617 


491 








Sample 


3 








16 


279 


















17 


250 


376 
















21 


398 


500 


509 














32 


341 


131 


184 


206 












33 


514 


474 


445 


633 


346 










38 


240 


325 


251 


222 


152 


501 








41 


308 


370 


547 


764 


174 


455 


-050 






42 


523 


497 


461 


646 


339 


937 


474 


468 




43 


366 


279 


537 


631 


297 


518 


-152 


793 


516 



litems 17, 32, 33, 38, 41, and 43 have been reverse scored such that a high 
score for all items represents a favorable response. 



ERIC 



Table 12 

Standardized Cutoff Scores for Decile Norms 



Decile 


Standardized 


Cutoff Score 


A 
U 


-1.28 


1 


- .84 


2 


- .525 


3 


- .255 


4 


0 


5 


+ .255 


6 


+ .525 


7 


+ .84 


8 


+1.28 


9 





ERIC 



Gillmore jq 
measurement errors, A "true" score is an abstract quantity defined in various 
ways but essentially indicating v;hat the score of an entity on a j^iven variable 
really is. True scores are, of course, not directly measureable . If they V7ere, 
there would be no need to estimate reliability coefficients. 

The S,E.II. is important in that it ^ives some notions as to hov; close 
observed scores are likely to be to the *'true*' score. ^ For example, if one is 
raakinc differential decisions on the basis of individual scores, one v7ould hope 
for a small S.EJI. As the S.E.M. increases, his decisions are more apt to be 
due to measurement error than real differences. 

The formula for S.E.II. is as follov/ss 

S.E.il. = S y/ 1 - r (A) 
where S is the standard deviation of the observed scores and 
r is the reliability. 

The theoretical distribution of observed scores Lends to be nonr^.al, with a 
mean at the true score and a standard deviation equal to the S.E.II. Consequently, 
about 68 percent of the observed scores vrill be within one S.E.II. of the true 
score in either direction. About 95 percent of the observed scores v/ill be within 
tv70 S.E.II. 's in either direction. 

All of the S.Eoil.'s for the various reliability estimates for items and 
subscales will not be presented. Rather some general notions V7ill be suggested 
in the context of norming, since S.E.II. 's are probably most important when con- 
sidering comparative judgments. 

Currently, the CEQ norms are presented in the form of deciles. Instructors 
can get a decile rank of zero thru nine, v7ith zero designating the lowest ten 
percent of the distribution of previous CEQ users, one designating from the 



'It is important to note that an S.E.M. does not indicate how close a specific 
observed score is to the true score, since that observed score could lie any- 
where in the distribution. 



Gillmorc 31, 
tenth to twentieth percentiles, etc. The CEQ deciles are computed by use of 
normal approximations (Gillmore, 1972). 

Decile cutoff scores can be standardized, i.e., converted to z scores. 
Furthermore, since standard scores have a mean of zero and a standard deviation 
of one, the formula for S.E.ii. (Formula 4) for standardized variables becomes: 

S.E.H. = >/ 1 - r 

The magnitude of the S.E.II. for various size reliabilities Is directly 
comparable to the decile cutoff scores. Thus, in Table 12, standardized cutoff 
scores for the deciles are presented. In Figure 2, the graph of standardized 
S.E.II. 's Is presented as a function of reliability. For any reliability, the 
standardized S.E.ii. can be determined. Then, one can assess for any standard- 
ized true score, how v/lde a plus or minus one S.E.H. Interval is in deciles. 
He can also determine the Interval for plus or minus two deciles, etc. 

For example, from the [jraph, it can be seen that a reliability of .90 has 
a standardized S.E.H. of .31. Thus, an Interval of plus or minus one S.E.H. 
Is .62. An interval of plus or minus tvjo S.E.H. 's is 1.24, etc. These values 
can be used in conjunction with Table 12. If an instructor's true score were 
in the center of the fifth decile, a standard score of .13, 68 percent of his 
observed scores would be from the standard score of -J.9 (.13 - -32) to +.55 (.13 
+ .32), V7hich Is within the fourth decile to the sixth decile. Similarly, 95 percent 
of his observed scores would be from the third to the seventh decile. 

In like manner, the reader can make his determinations for any size relia- 
bility and any true score. The reliability is a function, of course, of the 
Item, in v/hlch he is interested, and the number of raters. 

In the case of subscales, the reliability can be an equivalence coefficient 
or a stability coefficient. The one which is used depends upon the question 
being asked. If one is concerned about how stable his ratings are likely to 



GllliQore 32. 
be, i*o«« to what extent they would be expected to vary from class to class » 
then he should determine his S.E.IL from stability coefficients. In the con- 
text of evaluation, this would seem to be the proper coefficient to use. 

On the other hand, if the instructor is interested in how precisely a 
given attribute of teaching is measured, e«g.. Interest and Attention, then 
equivalence coefficients are the proper coefficients for determination of 
the S.E.M* If the primary purpose is diagnosino the effect of a course in 
terms of various attributes, the equivalence coefficient would seem to be the 
proper coefficient. 



NOTE: The results of this paper can also be used with the new CEQ Form 72 and 
Form 73 as it contains items and subscales found in Form 66. 



ERLC 



Glllmorc 33. 

References 

Ctonbach, L.J. Coefficient alpha and the Internal structure of testa. 

Payohometvikaj 1951, 16 ^ 297-334. 
Ebel, R.L. Estimation of the reliability of ratings. Paydiometvika^ 1951, 

16, 407-A24. 

Eisenhart, C. Expression of uncertainties of final results. Science, 1968, 
1^0,1201-1204. 

Gillmore, G.II. Approximating decile norms for the Illinois Course Evaluation 
Questionnaire by use of the normal curve. Research Report No. 342 ^ 
Urbana, Illinois: Measurement and Research Division, Office of Instructional 
Resources, University of Illinois, 1972 (mimeo) . 

Gillmore, G.ll. and Stallings W.ll. A note on "accuracy" and "precision." 
Journal of Educational Measurement, 1971, 8, 127-129. 

Jensen, A.R. Note on why genetic correlations are not squared. Psychological 
Bulletin, 1971, 75, 223-224. X^"- 

Nunnally, J.C. Tests and measurements: Assessment and prediction. New York: 
McGraw-Hill, 1959. 

Rozcboom, W.W. Foundations of the theory of prediction. Homewood, Illinois: 

The Dorsey Press, 1966. 
Spencer, R.E. and Aleamoni. L.M. The Illinois Course Evaluation Questionnaire: 
A description of its development and a report of some of its results. 
Research Report No. 292. Urbana, Illinois: Measurement and Research 
Division, Office of Instructional Resources, 1969 (mimeov. 

k 



Treloar, A.E. Correlation Analysis. Minneapolis: Burgess 



ublishing Co., 1942, 



Gillmore 



36. 



Appendix A 



The Illinois Course Evaluation Questionnaire <Form 66) 

\ 
\ 

\ 



\ 



ILLINOIS COURSE EVALUATION QUESTIONNAIRE — form G6 

MiMSiir.-iiu-.u .iiKl Krsf.in-li Division, Offii-o i.>f liistruclion.il Kcsuurcos, UNIN'KRSITY 01' II-.LINOIS ?: Uy Kioliiird K. Spmu-er. 



Oz 

H O 

z u. 

UJ 2 
Q - 


cr 






Q IJ Ul 
lii □ T 
H < £ a 
U ft 3 

u o ^ 0 

Q. - U 
X 




iTOOAVS DATti 


CE 






^ T or' ''J ?i 

O z - -i- 

5 S X 'J ^ - 
< r» 1.1 J ui o 
_j 








> 
< 

o 






Uj - 
m r- 






in w 

-:2e . - 


D to 
z ~ 
h- 






c 








Q 

O-r 

(? 

QCni 
D 

O 

u- 






MARK YOUR 
COLLEGE 


. o u ci :t w V H p 

vj J . ^ fv . ;j . c« ~ :j _ cc 

a 2 ^ I: ^/-^ > ^ S Q- < *r H X 

<L>LiJI^UX -t_J_JCV.-<; ^Cl.J3> O 


L.1 

:z cr 
o i ^ ex. a: 

f) X 'Z — '-^ ^' 

liJ CI. z .iT »t ;:: 
«i: o t3 w c: ^- 

■rt li, LI l/l IT o 


U 'J 
Q - 










3 fj 










« 





















UJ cn 



O 

I- 

O 
I- 

X o 

2 2 

5i 

< < 

-J u 



Q. 

Ill 
</) 

O 
Q. 
irt 
UJ 

tr 

liJ o 
o ^ 

>^ >■ 
-J ^ 

u < 

111 u. 

' -J o 
a- i£ 

2 Z 
O - 

i Q O 
Z _J 

< _J 

q: < 



u 



2 UJ 



z ^ 

UJ UJ 

UJ 3 

to z 

UJ O 

I- ? 

>- 



HUJ 

z^ 

^ UJ 
C O 



t/)UJ 



»- M fO ^ 



o 
o 

LU 
- OC 



J le.irn mo\e when other tenchi ng methods are used. 


SAMPLE MARKS: 

USE » , 

PENCIL ^ 1 ' 
ONLY , 

RESPONSE CODE: 

MARK ir YOU STRONGLY agree: 


It VA1S a '.vaste of time. 


Overall, ilie couise v/as good. 


The textbook v/as ve^y good. 


The iiistiuctor seemed to be interested in students as persons. 


More courses should be taught this v/ay. 


; The course held rny interest. 


1 viould have preteired another method of teaching in this course. 


Ii was easy to lemain attentive. 


WITH THE ITE.M 

MARK ir YOU -xGRGt^ mooerat^ly 

'AMTH The; iTZ\.i 

MARK IF YOU DISAGREE MODERATELY 

WlTfl THE 1 TEM 
MARK : ..^ IF VOU 5TR0/JGLV DISAGREE 


The mstfuctor did not synthesize, integrate or summarize effectively. 


: Not fiuich v.'as gained by taking this course. 


The instructor encouraged the development of new viewpofnts and appreciations. 


The course material seemed worthwhile. 


■,VITH THE ITEM 



It v/as difficult to remain attentive. 



Instructor did not review promptly and in such a way that students could understand their weaknesses. 



Homework assignments were helpful in understanding the course, 



There was not enough student participation for this type of course* 



The instructor had a thorough knowledge of his subject matter. 


IF PART IJ OR III 


IS TO BE 


USED 


: The content of the course v/as good. 


MARK HF.RE 







; The course increased my general knowledge. 


: The types of test questions used were good. 


COMPLETE SECTIONS BELOW ACCORDING 


; Held my attention throughout the course. 


TO YOUR INSTRUCTOR'S DIRECTIONS: 


, The demands of the students v/ere not considered by the instructor, 


OPTIONAL 




OPTIONAL 


Uninteresting course. 


PART n 




PART 111 


It was a very v/ofthwhrfe course* 


ITEMS 51-75 




ITEPyiS 76-100 


: Some things v/ere not explained very well. 









The v^ay in which this course v/as taught results in better student learning. 
The course material was too difficult. 



One of my poorest courses* 



Material in the course was easy to follow. 



: The instructor seemed to consider teaching as a chore or routine activity. 



More outside reading is necessary* 



Course matenal was poorly organized. 



: "::!■■ \ a; r:--^::^ 1 Course was not very helpfuL j ^--^ \ saI a 1 c ^^u I 


j v/J A ; -/I'^ii ; It was quite interesting. 




1 saJ A i^:^ M think that the course was taught quite well. \ ij\ \ ol'-rj \ 


3.' j s.*M A \ K-.ij i 1 wouldprefei a different method of instruction. 




3- i A I 0 Hi: \ The pace of the course was too slow. 




! <k\ '■ j !.: ;s;. i At times 1 was confused. 




50 ; sa| /■■ _ D j Excellent course content. 




i V -"^ : i^i; j The examinations were too difficult. 




h2 \ S'-'i i r;| :D j Generally, the course v/as well org in 1 zed. 




•'3 j ^aI a : ojiT; ; Ideas and concepts were developed too rapidly. 




4^ ; ^Aj ; ;:; jr;:'^ ; The content of the course was too elementary. 





; Some days I was not very interested in this course. 



It was quite boring. 



:.A^ - ; j 



".= : The instructor exhibited professional dignity and bearing in the classroom. 



Another method of instruction should have beeti employed. 



The course was quite useful. 



/••; I 



ERIC 



j :;Li \ :iAi A j c.-jsn | I would take another course that was taught this way. 

' *- — •■pEnMISsioN TO nEPROOUCE THfS COPY- ' TO ERIC AND ORGANIZATIONS OPERATING 

niGHTEO MATERIAL HAS BEEN GRANTED 



1 iOO 



UNDER AGREEMENTS WITH THE U.S. OFFICE 
OF EDUCATION. FURTHER REPRODUCTION 
OUTSIDE THE ERIC SYSTEr^ REQUIRES PER- 
MISSION OF THE COPYRIG HT OWNER." 



^ OPTICAL Scanning corporation © 4t.r« 



